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Abstract. We approach the theoretical problem of compressing a signal dominated by Gaussian noise. We 
present expressions for the compression ratio which can be reached, under the light of Shannon's noiseless 
coding theorem, for a linearly quantized stochastic Gaussian signal (noise). The compression ratio decreases 
logarithmically with the amplitude of the frequency spectrum P(f) of the noise. Entropy values and compression 
rates are shown to depend on the shape of this power spectrum, given different normalizations. The cases of 
white noise (w.n.), /™ p power-law noise — including 1// noise — , (w.n.+l//) noise, and piecewise (w.n.+l//| 
w.n.+l// 2 ) noise are discussed, while quantitative behaviours and useful approximations are provided. 
Keywords: Information Theory, Signal compression 

1 Introduction 

There are several motivations to consider the theoretical problem of compressing noise (or signals so stochastic 
that deserve this name). In some cases, the signal to be transmitted is intrinsically noisy (e.g. from scientific 
measurements) and needs to be compressed in a lossless way before any reduction process can be applied. One of 
the measured quantities which best exhibits this intrinsic randomness is the fluctuation of the cosmic microwave 
background (CMB) radiation. Considerable efforts have already been made in order to cope with the handling 
of such sort of data (see e.g. PJ-[p|)- Like other signals from scientific instruments on-board space satellites, 
CMB-measurements produce high rates of noisy data that have to be sent to Earth via a more or less limited 
telemetry rate ||. 

Electronic instruments (e.g. detectors, amplifiers) show characteristic low frequency instabilities (1// 
noise) to be added to white or thermal noise. When the signal measured with these instruments is weak, it can 
only be recovered from averaging many measurements. The averaging is possible only after a careful calibration 
of the low frequency instabilities, which in practice means that the whole (noisy) signal has to be transmitted 
(to Earth). This is an example that requires lossless compression of a signal dominated by noise. In the present 
work we would like to study, in a quantitavive way, to what extent noise can be compressed. 

This noise is usually treated as a Gaussian stochastic process with an arbitrary power spectrum (some 
relevant aspects of this type of processes have been considered in §-{|). We shall assume that its values are 
discretized — quantized — in a uniform or linear way. Given the properties of a Gaussian distribution, it is 
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possible to find analytical approximations for its information content, and we will take adavantage of them for 
obtaining the ideal — i.e., highest theoretically achievable — compression factor. 

In the present work we make no reference to the error brought about by the discretization process itself. 
Yet, a few words on this subject are perhaps called for. A typical measure of the error caused is the distortion 
D (many aspects of rate distortion theory are covered in 0. The same philosophy has been applied to the 
minimum discrimination information (MDI) theory — see e.g. Q and refs. therein). This magnitude is a sum 
— or integral — of the error between the continuous values of the initial random variable and the associated 
discrete ones, weighted by the probability distribution. According to the theory, for a given initial length of the 
random variable there is a minimal possible distorsion. Then, \D may be interpreted as the lowest 'distortion 
noise'. Usually, this optimal D falls as the length increases, but this implies to increase the entropy, thus setting 
a trade-off between compressibility and distortion. 

The text is organized as follows. In Section 2 we present a basic introduction to the problem of data 
compression. In Section 3 we deal with the one-dimensional Gaussian case, which will be helpful for studying 
multidimensional Gaussian noise with possible correlations in Section 4. Information content and compression 
values are then discussed. In Section 5 our conclusions are presented. A number of calculations have been 
included in the appendices. 

2 The basic data compression problem 

Standard lossless data compression techniques are applied successfully only to data sets with some redundacy. 
This redundancy can be formally expressed using the entropy H. It is easy to show (see below) that it is 
not possible to compress a (uniformly) random distribution of measurements. If noise is discretized to a high 
resolution (as compared to its variance) the resulting distribution of numbers approaches a uniform distribution. 
This indicates that lossless compression might not be very efficient when the data is dominated by noise, but, 
as we shall see, the problem depends crucially on the digital resolution and the range of values to be stored. 

Hypothetical data compression problems can be considered in the light of Shannon's first theorem (see 
). This theorem tells us that the Shannon entropy if of a source is the lower bound to the average length 
of the code units or 'words' (In addition, we know that such a lower bound can be fairly well approached by 
means of some of the available methods for coding, such as Huffman's, etc.). Then, the theoretical compression 
rate is defined as: 

average length per code unit 
r, opt Shannon entropy per code unit 

Of course, for this quotient to make sense, both quantities should be referred to the same type of code divisions 

(e.g. words, data values, blocks, packets, etc.) and must be written in the same length units (e.g. bits). 

Thus, our problem entails the entropy of the stochastic process generating the noise under consideration. 

In our case, this noise will be the result of a Gaussian proccess with a specific power spectrum. Its outcome shall 

be represented by a random variable ?y, which can be assumed to be stationary in wide sense. The discrete set of 

7y(t)-values for successive t increases will be treated like the components of a multidimensional Gaussian variable 
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with the power spectrum in question. Most of the time, we will deal with a bandwidth- limited spectrum, i.e., 

one where the frequencies are limited by an upper and a lower limit. Examining the associated Shannon entropy, 

we shall study the hypothetic chances of compressing the sort of data sequences generated by such processes. In 

particular, we will consider Gaussian white noise, Gaussian noise with correlation of the l//-type, and Gaussian 

noise with a mixed correlation of the type white- noise +1/ /-noise. 

In general, the compression rate c r for finite sequences of symbols that have been encoded is usually 

defined as the quotient between the sequence lengths before and after the encoding process — L[ and Lf, 

respectively — i.e., c r = — -. If {aj} and {<x/} (j = 1, . . . , A/" s ) denote the initial and final — or encoded — sets 
Lf 

of symbols, their average lengths are 

JV S 

Li = ^pjLiaj), 

3=1 

£; 



L f = y)pj L ( Q! i)> 



where pj, L(aj) and L(ctj) give the probability of the jth symbol and its length in bits before and after encoding, 
respectively. When the sequences are long enough, the rate c r can be replaced with the quotient between the 

initial and final average lengths per symbol in the way c r ~ =J-. We shall assume L(aj) — Li Vj, i.e., that the 

Li 

initial data representation consists of symbols of the same length. 

Shannon's first theorem (also called noiseless coding theorem, sec e.g. |ll[) provides theoretical lower 
(and upper) bounds to the final length per symbol in the way H < Lf (< H + 1), where H is the Shannon 
entropy 

H=-J2Pjlog 2 (p j ). (2.1) 

j 

An efficient coding method will have to approach equality to the lower bound. For one dimension, the Huffman 
scheme is known to be reasonably close Q (see also the performance of other methods such as the Rice algorithm 

in (ll|). Thus, the compression ratio will satisfy c r ~ =J- < -jj, being the equality the optimal case, given by 

Lf H 

Cr,opt=§. (2.2) 

Let's consider the case of an TV-dimensional (vector) random variable. Since the probabilities must be 
now referred to a multivariate distribution, (|2.l|) is generalized to 

Hn = - I'-' ^ ln ~2W',: ./ \ )• (2-3) 

We shall suppose that each of its components is a one-dimensional random variable of the same type. In 
addition, there might exist possible correlations among these components. There is a well-known inequality for 
any iV-dimensional random variable fj — (r/i, . . . , t]n) (Gaussian or not) relating the joint Shannon entropy Hn 



1 To give an idea of this closeness, let's quote a bound found in |12| : calling r = Lf — H, and p ma x = max ({Pj})i then 
T < Pmax + log 2 ( 2 '° S e 2(e) ) = Pmax + 0.086. 
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and the individual Shannon entropies of each component, Hi(rjj), j — 1, . . . , N, which reads 

H N (rn, ...,r) N )< Hi{m) + ■■■+ H^tjn), (2.4) 

or, equivalently, 

h(r] U ...,ri N ) = T7 <—>#!(%), (2.5) 



N 



where h denotes the joint Shannon entropy per component. Unlike Hjq, h does not grow extensively by merely 



increasing N. When r)±, . . . ,t)n are all of them of the same type, (2.5) reduces to h < H±. Defining the initial 



length per component l\ as the analogue of L\ for each vector component, eq. (2.2) may be rewritten as 



£r, opt 



(2.6) 



It is essential to note that the equality in (2.5) is satisfied if and only if the N components ryi, ... , r/jv are 
independent. Therefore, for independent variables of the same type, h = Hjsf—i, and it is enough to study the 
N = 1 case. 

Observe that for a uniform distribution, where pj = 1/OVs, we have that h = 7; = log 2 (A/" s ); so, no 
compression is possible (c r Q p^ = 1). 



3 One-dimensional Gaussian variable 

We will try to find this theoretical rate for a zero-mean Gaussian white noise 77, — whose probability density will 
be called f{rj) — with variance equal to cr, and whose values are discretized or 'quantized' to a given resolution. 
When discretizing, we gather results into intervals of some fixed width, which shall be denoted by Aij. If this 
width is small enough, we may assume that all the values that have fallen into the same interval have, roughly, 
the same probability. Thus, to each interval we assign a 'probability' value as follows 

A77 



/ A77 ArA 
n m the interval around x [ X 1 X H — J 



p(x) Ar; 



x + 



x ■ 



A?y 



/(x)Ar? = 



e 2a 2 



V2 



A?7 



7T(J 



This will be done for each 77W , with = jArj, j € Z. Each interval will be called ft) = \ - 17 W + ^ 
In order to properly talk about probabilities, the set should be well normalized. Therefore, we write the 
probability that 77 takes a value in ft) as 

J 2 (A??) 2 



-P.; 



(nh 



(3.1) 
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where 

^ n 2 (A??) 2 

z= J2 e 2(j2 ■ ( 3 - 2 ) 



n— — oo 



This Z, introduced in order to fulfil the normalization condition, may be also regarded as the partition function 
of a system with energies {E n = irn 2 } at temperature T = ^ 2 , with adequate new units for the Boltzmann 
constant. 



In order to calculate the ideal compression rate, we need to find the Shannon entropy (2.1). Since N = 1, 



H = Hi — h, and the result (see subsec. A.l in the appendix) is 

h = log 



it. — cr 
<2ive — 
An 



V(A^) 2 

which depends on a and Arj only trhough the dimensionless quotient 

An 



o(^e-&,...\ (3.3) 



a 



ee A (3.4) 



Thus, the smaller A ~ 1/VT (the higher the temperature) the larger the entropy h. Compare this with the 



result of a naive integration without discretization, which would be log 2 yV^jrea^j = h cont . In the A — > limit 
the exponentially small corrections vanish, but the logarithm of A diverges. Thus, 

h ~ h cont - log 2 (A77) (3.5) 



(see explanation in refs. iflif or []l5l| , or in app. ^|, or our own comments below, after eq.( 4.11 )). 



Let's now write the initial mean length as 7; = log 2 (A/" s ). This means that, using a suitable binary 
representation, Af s is the number of effectively distinct 77- values that can be considered (although 7; is an integer 
only when Af s is an exact power of 2, these variables will be treated as if they were real). 

First, we can imagine a process in which the initial length per symbol h has been fixed independently of 
Arj (this could be the case when we are worried about instabilities of the signal). Then, the optimal compression 
rate would just be the quotient 

c - pt 5 r iog 2 (Jb/A)- (3 - 6) 

So that the larger we can make A, without loss of relevat information, the larger the compression. If the final 
sensibility S we need is obtained from some later average of M measurements of this noise 77, then we can make 
A ~ 1 as far as M > (a/S) 2 . In this extreme case the compression can be as large as c ri0p t — 7i/2.047, eg. 
Cr,opt = 7.8 for 16 bits symbols. Fig. [j] shows (as continuous lines) the entropy h and the compression c ri0 pt as 
a function of A. 

Another possibility is to work with 7; as a function of R and A77. We suppose that the values of our 
random variable r\ span a range R = max(?7) — min(?7). Assuming our discretization to be linear, it is clear that 

R = JV s Ar] (3.7) 
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Figure 1: Shannon entropy per component h (left) and associated optimal compression rate c r> op t (right) — by 
,(3.6) — as functions of the discretization parameter A = Ar//a, for a fixed 7; = 16 bits. The three 



formulas (^ 

curves correspond to n p = with P(lo) = A = 1 sec. (solid line), a combination of n p = and n p = — 1 of the 
form P(u>) = A(Aq + u>o/\u>\) with Aq = 1, u>o = ^Max/10 (dashed line), and to n p = -1 for P(u>) = Au>o/\u>\ 
with the same value of loq (dotted line). No equal ui-p-constraint has been imposed. 



and, therefore, 



Formulae (2.6) and (3.3)-( |3.8| ) enable us to put c I> op t as a function of either A?y or l{. Then, 
_. log 2 (i?)-log 2 (Ar/) li _ li 



t - ; r,opt 



log 2 ( Af ?) 



Ii-logaCJZ) 



ii + log 2 



MS)' 



(3.8) 



(3.9) 



If we limit R to a given number of ct's — say JVo — around the origin, only the values in (—Noa, Nqct) will be 
taken into consideration. Thus, R = 2Nq<j, and we can further write 



Cr,opt 



1 



li + log 



V2?re 
2^Vo 



log 2 



V2ire 
~2N^ 



(3.10) 



log 2 



2N a 
Ar] 



Note that c ri0p t cannot be larger than one if Nq < A^ocrit = 2 & — 2.0664. This is interpreted as a critical 
size of the acceptable range. On the other hand, by taking larger and larger values of Nq one could achieve 
arbitrarily high compression rates, but this would mean to collect sufficiently meaningful amounts of data very 
far from the mean. This could correspond to rare events which might not follow the Gaussian distribution]^] 

2 Note that there is no contradiction here because even in the presence of non-Gaussian rare events, the bulk of the data might 
still be well described by a Gaussian so that our estimations could still yield a good approximation. 
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Figure 2: Optimal compression rate c rj pt — by formulas ( p.q ),(3.9) — as functions of the discretization interval 
A = Ar//a (left) and as functions of the initial mean length in bits \ (right). The three curves have the same 
parameters as in Fig. 0but, this time, with variable l± and R = 2N$a, Nq = 3. Note that c r , opt has a divergence 
for small h, which comes from the vanishing of h. 



In general, a reasonable choice would be some No moderately above iVocrit, but this depends critically on the 
subsequent data analysis we want to carry on with these data. 

In Figure [2]we show c r! opt for the case of white noise (continuous line) with R = 2Nq<7, No — 3. The main 
difference from Fig.|l] is that in the former 7i = 16 bits while in FigJ| we choose li according to Arj as in eq.(3.8) 
with Nq — 3. Although the distance of three sigmas is already a long way from the mean, the compression 
rates found are rather small. In Fig. |^ we also show (right panel) c, i0p t as a function of li, showing how the 
compressibility increases as li gets small. 



4 Multidimensional case: Gaussian stochastic processes 
4.1 Uncorrelated Gaussian variables: white noise 

Suppose now that we have N uncorrelated Gaussian variables with different variances ax, ... , un- Although we 
shall keep this general notation, we are only interested in processes where ax = • ■ ■ = apf, which is the case of a 
Gaussian stochastic process stationary in wide sense. As long as these variables are uncorrelated, we have to 



apply (2.4) as an equality, which, combined with (3^3) and equally quantizing in all dimensions, gives the joint 
entropy 



H = H N = log 2 
Here we may interpret 



/ 27re 



N N 

n< 

rn—l 



, 2na 2 



2„2 
T 



N 



[J a 2 m = Det(Co), 



(4.1) 



(4.2) 



rn — l 
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1.81 
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2.34 


2.36 


2.37 
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2.81 


2.85 


2.85 
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3.27 


3.31 


3.32 




1.0 


3.67 


3.69 


3.80 



Table 1: Comparison of optimal compression rates c r , opt with actual rates from simulated Gaussian white noise 
compressed with implementations of the Huffman — c^ 1 — and arithmetic — — methods. These values are 
roughly a half of those in the solid-line curve of Fig. [l], as \ is now equal to 8, instead of 16. 



where Co is the diagonal matrix 
It is usefull to define an effective variance as: 



C = diag(<Ti, ...,a 2 N ) 



^Det^Co), 



(4.3) 



(4.4) 



so that when a\ 



un we will have ctq = 



apf, which also agrees with our definition of the 



1-point sigma ct\ p below eq.(4.22). 

Thus, the entropy per component is conveniently written as in the previous case ( |3.3| ): 

27R7 2 -2^4 



h = log 2 



'2ne 



At] 



O 



(4.5) 



where the 0-subscript means that this is the uncorrelated case. As we shall see below, to deal with correlations 



will just mean the replacement of Co with a new correlation matrix — say C in (4.4). 

We have done simulations of Gaussian noise with cr± = . . . = ctjv and the data have been represented with 
a fixed 7i = 8 bits. The iV-dimensional variable is then compressed by the Huffman and arithmetic methods, 
and the compression rate is found as the quotient between the sizes of the initial and the compressed files. 
This actual compression rate is then compared to the optimal one, i.e., to c r , op t = -r. The results are presented 
in Table 1. The agreement is better as N increases. The explanation is that, in practice, the compressed files 
take up some further space for storing the conversion tables between both symbol sets. Obviously, since the 
number of different symbols is fixed — 2'i = 256 — the relative contribution caused by the size of these tables 
decreases as N grows. 



4.2 Gaussian variables with correlation: coloured noise 



Now, suppose that we have an ./V-dimensional variable fj = (r)i, . . . , t)n) whose components are correlated 
according to the entries of some covariance matrix C . By mathematical definition, C(rjj,r}k) — ((% ~~Vj){ r lk — 
rfk)*), where (. . .) denotes statistical average, and rfj = {rjj). In the case of zero-mean variables, it reduces to 



C(r)j,r) k ) = (vjVk*) = C jk 



(4.6) 



(for the changes to be made when the mean is not zero, see sec. B.4). In practice, a discretization or shot 
noise fluctuation could be added and the theorical correlation would be changed to Cjk — (i]j 7 lk*) + TjyT ■ In 
general, this is of little interest as it just amounts to a constant increase of the power spectrum. The values of 
fj can correspond to continuous random variable 77 = rj(t) sampled in TV time intervals (77 = T)(t)). For a wide 
sense stationary stochastic process we have that Cjk = Cj— fc can only be a function of j — fc, i.e., the covariance 
matrix is a Toeplitz matrix. 

A sequence of a Gaussian stochastic process has a joint probability density given by 

f 



fj T C 



-iff* 



In the absence of correlations C is just the Cq of (4.3) and therefore C _1 = diag(l/crj, . . . , 1/u^), but now we 
expect the presence of nonvanishing off-diagonal coefficients. We may assume that all the ff components are 
real. Each dimension will be discretized in the same way as for the one-dimensional case. Therefore, we will 
consider the joint probabilities 



p[r)i £ Iji, ...,T] N eL 



3N\ 



1 

~z 



{3u...,j N ) T C- 1 {ju- 



,JN) 



(4.7) 



where the normalizing quantity Z is given by 

(Ar,) 2 



Z = 



E 



(ni, . . . 1 n N ) T C 1 (m, . . . ,n N ) 



(4.8) 



The ensuing Shannon entropy (see subsec. A. 2 in the appendix) is 
H = Hn 



log 2 



'Dot 



2?re 



(Ar/) : 



C 



(Ar/) 2 



(4.9) 



where this a is given in section (A. 2). Note that the next-to-leading terms are, again, exponentially small, and 
their typical size can be adequately expressed as a function of a dimensionless parameter ^ = X. Before going 
on, some comments are in order. The previous relation can be rewritten in the form 



H = log 2 -\/Det(27re C) — A r log 2 (A?7) + exponentially small part 



(4.10) 



The first term on the r.h.s. is just the result of having calculated H after replacing the multiple sum in (4 
with a multiple integral. Therefore, we shall call it H cont - Further, in the continuum limit, A — > and the 
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exponential corrections should vanish. This leads to 

H = H cont - N log^Aij), 



(4.11) 



When an entropy associated to a discretization of width A77 is compared with its continuous version, we realize 
that we gain N times the 'information' leaked by mistaking a single clement of unit length for an interval of 



size Aij, which is N [— log 2 (A?7) + log 2 (l)] = — Nlog 2 (Ar]). In terms of entropy per component, ( 4.11 ) becomes 
log 2 (A?7), which generalizes 



h = h cont — log 2 (Ary), which generalizes (|3.5j), as now h cont has the same expression as in (3.5) but changing 
o by ex e . Furthermore, there is a critical A^-value for which the whole h vanishes. When this happens, the 
discretization is so coarse that the little resolution kept is not enough to store any effective information at all. 
Another convenient way of writing the entropy per component is 

h = log 2 



'2%e 



A77 



^ ( 27TO- 

O -— - — —7 e < A "> 



T 



(4.12) 



were we have now that the effective variance is: 



al = Det 



l/N 



(C), 



(4.13) 



These expressions generalize to correlated variables the result in eq. (4.5) for ho by just replacing Co with C and 
Uq with <j e . Thus, for a general covariance matrix C we only need to find a e above to obtain the corresponding 
entropy. 



4.2.1 Calculation of Det(C) 

The next task is the calculation of the determinant of C. Going to Fourier space — see sec{^, subscc. p^2| — one 
obtains the relation 

A \ N 

Auj 



Det(C) = 



2nAt 



Det 



(c), 



(4.14) 



where C is the Fourier-space representation of C, At is the Fourier time sampling interval, Aw the associated 
frequency interval and, taking into account that N samples are considered, Alu — . In order to find concrete 
results, some sort of hypothesis on C has to be made. Here we consider stationary (or homogeneous) processes, 
for which the the covariance matrix is a Toeplitz matrix, and therefore C is diagonal — see subsec. [B~3| — so that 
(r}{u>) rj*(ui')) = P(u>) (*>Dirac(w — u>'), whose discrete version yields: 

C jk = Piujj) (4.15) 

i.e., C is a diagonal matrix. In all these cases, the problem boils down to the properties of the P{u>) function. 
If we denote by P the diagonal matrix: 



P = diag(P(w_jv /2 ), . . . , P(w N/2 )). 



we can write the effective rms correlation a e that appears in (4.12) by: 



n p (»i) 



l/N 



(4.16) 



(4.17) 
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The white noise case corresponds to the constant power spectra P(w) — A and the matrix P is proportional to 
the identity. In this case, 

A 

(4.18) 



CT„ = (Tn 



~ 2^Ai' 

showing that the larger the sampling interval At the smaller the variance, as expected. 

We can also express the entropy as a difference from the entropy h of a white noise spectrum of amplitude 
P = A by: 



h=h Q + — log 2 



■'Dct | ip 



(4.19) 



In general, given two power spectra P-y and P 2 with effective correlations a e \ and rx e2 , the entropy differences 
are given by: 



h 2 -h 1 = log 2 



0"e2 


= log 2 


.Cel. 





Det 1 / 2JV (P 2 



Dct 1 / 2N (P 1 ) 



^log 2 



Det (P 2 Pf 



(4.20) 



Entropy comparison for equal-<7i p processes. From the expression above (4.17) it is clear that o\ is linearly 
proportional to the amplitude of the power spectrum P(w), so that h will depend (logarithmically) on the 
normalization of P(w). It is interesting to compare the entropy for different shapes of P(w) which have been 
normalized in the same way. Here we will consider the case where we normalize P(w) so that ff has the same 



1-point variance. We will see that this is equivalent to fix the traces of the P matrix (4.16) 

Aw 



and flOJ ) 



First, using eq.( B.13 ) and the properties of the trace, we get Tr(C) = 

(c). 



2vrAt 



Det(C) 



Tr(C) 



TrlC 



Det 



Tr ( C ) . Combining this 



(4.21) 



Now, bearing in mind the usual definition of the 1-point variance: a 2 , which reads a\ v = C(j](t),r](t)), let's 
introduce 



of JC) = — Tr(C) = — Tr(P). 

For the case of uncorrelated variables (white noise) with equal sigma: a\ = 



(4.22) 

= (Tjv = ctq: we have that 



&ip = a e — a o m eq.(4.17). In general <7 lp 7^ cr e when there are correlations. 



Using this definition, we have from (4.21): 

Det 



s- T wV"(C)=o? Detl/JV(P) 



lp Tr(P)/N 



Inserting this result into (4.12), we can write 

h = h 
where 

hip = hjaip) = log 2 



'2ne 



Q"lp 

A77 



Det(P) 
[Ti(P)/N] N 

h exponentially small part. 



(4.23) 



(4.24) 



(4.25) 
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These new formulae are adequate for comparing processes with the same value of a 2 p and different P's (i.e., 
different power spectra). The 1-point entropy h\ v denotes the entropy per component of a white noise with a 



variance o\ = . . . = cr 2 N = af p , as in this case P oc 7, causing the second term on the r.h.s. of (4.24) to vanish 

j_ 

N 

rip ana n ^ n\ p " 1 



For any square and positive semidefinite matrix M, the inequality -^Tr(M) > Det 1 ' N (M) holds. Both 
C and P satisfy these conditions. Therefore a 2 < a\ v and h < h\ p . The equality is achieved when Pa/, i.e., 



only for the white noise itself. In any other case, a Gaussian process with the same a\ p has smaller effective 



variance and lower entropy than the corresponding white noise. This is easy to understand from (2.4) or (JZ 
Asymptotic expressions. When the exact form of Det(P) is not easy to obtain, we can resort to the following 
procedure. We may assume that P(—lu) = P{oj) and that the mode with ojq = has to be removed, as often 
happens (this mode is related to the correlation at t — ■+ oo and, if one requires that the system be ergodic, it 
should vanish). Then, 

N/2 N/2 

log 2 [Det(P)] = log 2 [/ J K)]=2^1og 2 [PK)], (4.26) 

j = -N/2 j=l 

and an application of the Euler-Maclaurin summation formula (see e.g. [fl7|), leads us to the approximation 

N/2 



' 1 /""Max 1 

2^ !0g 2 [-P(^)] = £~ / dtJ log 2 [P(w)] + - (l0g 2 [P(w M a X )] + l0g 2 [P(^ min )]) 



+ higher order terms in Aw. 



(4.27) 



The same method can be applied to the calculation of Tr(P), in (4.24), i.e 



N/2 

2^PK) = 2 

3=1 



I /-"Max 1 

— — / djjj PUS) H — (P(wMax) + -P(w m in)) + higher order terms in Auj 
Alj I, , 2 

J "mill 



(4.28) 



Filters . Quite often, stochastic processes go through what is called a filter. Formally, filters can be pictured 
as multiplicative changes in the power spectrum. Therefore, everything happens as if we had a new power 
spectrum function, say P' , coming from the replacement 

p{u) — ► p'H = PH0H, 

where the <p function is the frequency response of the filter itself. Let hi denote the new entropy per component. 
It is immediate that the change caused by the introduction of <f> will be given by 

hi = h + h<h, 

1 (4.29) 

/i0 = — log 2 [- v /Det($)], $ = diag(^(w_ JV / 2 ), ■ ■ ■ ,<j>(w N /2)), 

where h denotes the entropy per component for the same process when no filter is present. 

4.2.2 Simple power-law power spectrum 

Here, we will consider a power spectrum of the type 



PM = A i-L , (4.30) 
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where A is a constant that sets the overall amplitude and wq some characteristic scale that sets the time units. 
Taking into account the discrete w-values ( B.10| ) we evaluate 



Dot 



Aw 



Nn„ r 



2n„ 



(4.31) 



(where the zero mode j — has been omitted). Making use of Stirling's approximation for large N/2, and using 

2 f WMa 



the frequency relations ( B.ll ), we find: 

a 2 e = Det 1 / JV (C) 



.4 



2nAt \ewnAt ) 



V e w 



(4.32) 



where <7o corresponds to the white noise case (n p = 0) . If we normalize the spectrum at wq = WMax then for 
n p < we have that h > ho and the optimal compression rate has to decrease, while for n p > we have h < hg. 
Some special values are given in table 2, and are also illustrated by Fig. |l|. However, this comparison depends 
on the normalization and involves noises with different values of a\ p , as we have only changed the value of n p 
without doing anything to maintain the initial <j\ p . In this case, by eq. (4.22), 



1 1 



lp 2irAt N 



.4 



Alu 



nNAt V w 



5/i 



N/2 



(4.33) 



Making use of (4.24), we are led to 

h = h 

= h 



n r . + 1 



\(- 






)'] 




N 


log 2 ( 





log 2 



N 



ylog 2 (e) - ^log 2 



O 



N ' 



(4.34) 



where the Stirling approximation has been applied. When n p > — 1, we apply the Euler-Maclaurin summation 



formula (4.28) and obtain 



.4 



2nAt • 



1 \w Q At 



1 + 



for n p > — 1. 



-1 case may be more straightforwardly estimated by using 



N 



N 



For the 



where 7 is Euler's constant: 7 ~ 0.57721 .... So, o\ p becomes 



5_ 1 - =* - + l +7 = In - 



N 



'i P 



(Awp) 
2tt 2 



for n p = — 1 



(4.35) 



(4.36) 



(4.37) 



Then, by the previous formulas and by (4.24) 
h 



tip 



pog 2 (e) + ilog 2 (n p + l) + 0^ 



log 2 (A0 



N 



hip + 2 lo S2( e ) - \ lo &2 



( f I - ; 



log 2 (AQ 

AT 



for n p > — 1, 



for n p = — 1, 



(4.38) 
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where h\ pi given by (4.25), is the entropy per component of a white noise with the gq = a± p . Note that, although 
it seems that h diverges with N for n p — —1, this is an artifact of this type of comparison with a fixed o\ v . 



Although a\ diverges logarithmically with N, the information content does not, as a\ in eq.(4.32) is finite: 



(Awo) 
2tt 2 



for n r , 



(4.39) 



Some examples are illustrated by the 5th column of Table 2 and Fig. 




Figure 3: Entropy and optimal compression rate for different power spectra with the same eri p , but (unlike in 
Fig |^) keeping 7i = 16 bits fixed and luq = u) m i n . The present set of cases is: n p = (solid line), n p = — 1 
(dashed line) and n p = +1 (dotted line). 

Fig. I], shows the entropy h as a function of the spectral index n p given by the above formulas. As can 
be seen, h has a maximum at n p — 0, as expected. 



4.2.3 7° + 1//' spectrum 

In practice, realistic power spectra include often combinations of several powers. This new example corresponds 
to a power spectrum including two terms: one with n p — (white noise) and another with n p = — 1 (usually 
called 1// noise), which we write as 



P(u) = All 



A 1 



fk 
l/l 



(4.40) 



where / stands for frequency w = 2irf, and fk for the so called knee frequency, where both contributions are 
equal. We shall assume that w has been discretized as in the previous cases. Because a direct evaluation of 
Det(P) would not be so easy now, we shall apply the above commented approximation based on the Euler- 



Maclaurin summation formula. After performing the integration (4.27) for the P(u>) of eq. (4.40) one gets 



A 
2nAt 



u '/ u Max 



(4.41) 
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Figure 4: Entropy h (continuous line) and optimal compression c r , D pt (dashed line) for Z; = 8 bits as a function of 
the spectral index n p ( for a power law P(uj) oc w"") with a fixed one-point variance u\ v and Ai = Ary/(Ti p = 0.25 



The corresponding entropy is just given by eq.(4.12). When LUk « WMax we recover the white noise case 



eq.(4.18), while in the case u>k » WMax the 1// noise dominates and we recover eq.(4.39), as expected. We 



observe that a combined power spectrum (4.40) with reasonably small A is effectively equivalent to one of the 



type P(u>) = A — I with an intermediate n p between and — 1. An illustration of the values of h and 
optimal compresion for this case is shown in Fig. H and also in Fig. \a as dashed line. 

Typically we will have that w m i n << wjjax and also u> m m « lo^. In this case the only relevant parameter 
is r = Wfc/«M ax : 



A (1 + r) 



1+r 



(1 + r) 



l+r 



2nAt r r ~ u r r ' 

where r = reproduces the white noise case and large r reproduces the 1// case (n p = — 1) with arbitrarily 
large normalization. For r — 1 we have that the effective variance of the signal is four times as large as the white 
noise part cr^ = 4<7g, so that the entropy will be one unit larger with the combined spectrum than with the 
white noise alone. Other values for h and c r as a function of r are shown in Fig.^|. In this case A = Arj/ao = 1 
so that ho ~ 2.047 and c r> op t — 3.91 (7; = 8 bits) which agrees with the values at r = 0. 

Another way to compare the two cases is to use an equal <j\ p comparison with a white noise. In this case: 



N 



Tr(C) 



.4 



2nAt 



■S-x 



(4.43) 



and, using ( 4.24 ) 



h = h V p - log 2 



' ^Max +U>kS-l{N/2) 



WMa> 



log 2 



O 



N 



(4.44) 
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Figure 5: Entropy h and optimal compression c r , op t (h — 8 bits) as a function of r = w^/wMax for a '/ + 1//' 
noise. We have chosen A = A^/ctq = 1 and symbols of \ = 8 bits. 



where h\ p stands for the entropy per component of a Gaussian white noise with the same 1-point variance o\ p . 
An example of this type of noise is shown in the 4th column of Table 2 and Figfl Of course, the " Max — > 



limit of this expression yields the n p = — 1 case of ( 4.38 ) (see also the 5th column of Table 2). 



A = Ary/cro 


ho — hip 


h 




n p = 


f 


+ 1// 


n p = -1 






era 


(Tip = (TO 


(Tip = <7o 


0.05 


6.37 


7.37 


5.89 


5.71 


0.25 


4.05 


5.05 


3.57 


3.39 


0.50 


3.05 


4.05 


2.57 


2.39 


1.00 


2.05 


3.05 


1.57 


1.39 



Table 2. Shannon entropy per component h for large N, and several values of A = Arj/ao. The purely white-noise case 
ho for a given oo and A are listed in column 2. Columns 3 and 4 gives the results for a combination P(u) = yl(l + ojfc/|aj|), 
with u>k = ajMax ( r = 1) when the white noise part is fixed to the same oo (column 3) and when the 1-point sigma is fixed 
to crip = <to (column 4). In column 5 we have listed the values for a correlation of the n p = —1 type -P(t^) = A(w>o/|o>|) 
and crip = Co. In the last two cases TV = 1000. 

We can see there how h < hi p when we compare spectra normalized to have the same af p , while h > hi p 
when we just add a term (1/f) to the (constant) white noise power spectrum. The interpretation is simple, 



as shown in eq.( 4.12 ) the entropy is given by the effective correlation. On the one hand, adding power always 
increases a e (see eq.(1.17)), and therefore h. But, on the other hand, a\ < a\ p so that, when <ti p is fixed, any 
power spectrum gives smaller h than the white noise and, as we said above, this can be easily understood in 
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Figure 6: Comparison between purely white noise (solid line) and two processes of the type P(f) oc 1 + fk/\f\ 
with /k =10 Hz (dashed line) and 100 Hz (dotted line), for li = 16 bits, and without imposing the equal-<7i p 
constraint. In both cases h > ho, while in the analogous example of FigS it happenned just the opposite. 



the light of inequality (2.2) . This change of behaviour can be seen comparing Figs. [I] and | with Fig.i 



4.2.4 Examples of piecewise-mixed spectra 

1. Here we study the piecewise-defined spectrum: 

.4 

P(u) = I A 



for lo < ljl, 

for u>l < oj < ojh, 



A 5—, for UJ H <LO < LJMax- 



(4.45) 



The result of applying (4.19) and making asymptotic approximations for large values of ttS and ¥ ax is 



1 - 



2wMax 

2. Another case which can be of interest is 

P{u) = 



logo(e) log 2 | aJMax ) + higher order terms. 

2 \u) L u) H ) 



A' 



for oj < ujl, 



B 



A + ,— r, for U) L < LU < UMax. 

\lo\ 



(4.46) 



(4.47) 



Taking now as reference the case in which B = and A' = A, we may write 

"77" 



h = h(A' = A, B = 0)+ log 



1 + 



Aui 



^Ma 



■log 2 



B 

Alot 



B , 

+ 7 lo g2 



Au)t, + B 



WMa> 



log 2 



(4.48) 



+highcr order terms. 
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5 Conclusions 



We have studied the Shannon entropy h of a Gaussian discrete noise rji characterized by its power spectrum 



P. It amounts to h ~ log 2 (V27re er e /A77) , where er e = u e (P) is given by eq.(4.17) and Ary is the discretization 
width. The finite- TV corrections to this formula are exponentially small (eqs.( A.6 ) and ( A. 14 ) in Appendix |a|). 
The first thing to notice is that a e changes linearly with the amplitude of P, so that the entropy increases 
logarithmicaly with P. For a given normalization, how does the entropy depend on the shape of the power 
spectrum? We can compare the entropy of two types of noise using the entropy difference Ah = h — ho- In 
cases with power-law spectra P(lu) cx Ojj~) i A/i can be quite sensitive to the choice of u>o, whose variations 
may even cause a reversal of the sign of Ah. This type of change is due to the already commented logarithmic 
dependence of h on the amplitude of P. If we fix the (1-point) variance of the noise, we have seen that the 
maximum entropy (minimum compression) is the one given by white noise (or constant P), as expected. For 
P(u>) cx uj np spectra with fixed one-point variance, we have that the larger \n p \ the smaller the entropy for 
Tip > — 1 (eg eq. (4.38) and Fig. 2). Notice that when Ai] > a e we have h < indicating that the data 

have been discretized with such a low resolution that there is no information left. 

We have defined the optimal compression rate as the ratio of the initial average length per code unit 7i over 
the Shannon entropy h per component: c T< op t = For a linearly discretized data set with Ij = A/bits = log 2 (A/" s ) 
bits the optimal compression rate depends on the discretization width Arj through a simple relation: 

_h_ Abits ( <-l\ 

Cl ' opt -h-lo g2 (V*rea e /Ar,y ^ 

The choice of Arj is in principle arbitrary and depends on what we want to do in the data processing of the 

signal (noise). The final compression factors will depend only on the ratio of these two quantities A = 

and the number of bits Abits chosen to represent the data. Another way of writing this results is: c r , opt — 
log 2 (i?) - log 2 (Ary) 

— : — — - — , where R is the range of the random variable and h cout is a constant depending on the 

ficont - log 2 (A?7) 

type of process, which may be interpreted as the Shannon entropy per component in the continuum limit. In 
mathematical terms, h cont involves the determinant of the correlation matrix. If the initial length 7; is held 

fixed, independently of Ari, the relation is just c r opt (A?7) ~ ^ — — -. 

frcont - log 2 (A?7) 

The purely white noise case (n p — 0) offers rather slight hopes, for moderate ranges R. If we choose 
R = (—No<r, Noa) with No — 3, and A = Arj/ a — 0.25 the compression rate is of c I% op t = 1.13 — only marginally 
above one — and, yet, this happens at the expense of losing resolution to the extent that only four distinct values 
are observed within each interval of width a. Less resolution than that may be too little for many applications. 
One could wonder what happens, in the opposite case, when resolution is kept at any cost. For a binning of 2 s 
distinct intervals within the same range, A has to take on such a value that the compression rate is a meagre 
1.07. Such a thinly spaced binning means that the white noise is seen very much like a uniformly distributed 
one, and has a similar uncompressibility. 

On the other hand, for fixed af p a negative spectral index lowers the effective information and helps 
compression. Moreover, the optimal compression rate increases as the sampling time interval decreases. As we 
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see in Fig. |2j, when Ai] = 0.25 the compression rate for n p — — 1 with the same o\ p as for the white noise 
is ~ 1.4. Moreover, the difference between n p = — 1 and n p = increases as the discretization parameter 
A = Ar/fa grows. However, one cannot think of arbitrarily raising its value, as such a thing would imply a 
widening of the discretization error, and an even greater loss in resolution for the values of our variables. 

A combination of both types has also been studied by taking a 'mixed' power spectrum with n p = 
plus 1// (i.e. n p = —1) terms. If the coefficient of the n p — part is low enough, the behaviour shown is 
intermediate between purely n p = and purely n p = —1, and can be interpreted as if it just had an effective 
n p between both values. When P(lu) oc (^Aq + jtjjn, if Aq is set to 1, h is not too sensitive to increases in loq 
much above the knee frequency. On the contrary, if ujq is kept constant, variations in Aq may easily change the 
sign of Ah. As a common feature to all possible situations, one observes an increase in compressibility as the 
measured data involve more and more correlation, i.e. larger dominance of their spectral f rlp -parts with n p =/= 
(see Fig. |). 

Imagine a situation of a data set that consists of a slowly varying signal (to be stored in 7; bits) plus 
large amplitude noise that dominates over the signal on large frequencies. The signal is to be recovered by 
averaging the noise after transmission (and therefore compression) and a careful calibration of instabilities in 
the noise. This is a commom situation for scientific measurements on-board satellites collecting data with low 
signal-to-noise ratio. In this case the noise component can be kept with a low resolution and one can choose 
Arj ~ er e which gives h ~ 2.05 indicating that all information is contained effectively in two bits. Then, high 
compression rates c F) pt — 7;/2 could be obtained: e.g. c r . pt — 8 for 7; ~ 16 bits. To achieve such a high 
compression values in practice, an efficient coding method has to be used. For one dimension, the Huffman and 
arithmetic schemes are known to be reasonably close to the optimal value. When data (symbols) are correlated 
in a manifest way, as the general case considered here, other methods have to be used in combination. One of 
the simplest methods that take into account correlations is run-length encoding, where the signal is converted 
to a stream of integers that indicate how many consecutive symbols are equal (see flj^| ). This would be quite 
efficient in the situation we have just mentioned. 

The data discretization or 'quantization' process causes a distortion error. This issue has not been 
considered in the present paper, as we have kept it outside of the scope of this study (i.e., we have started from 
a data set already quantized in a given way). Nevertheless, the results in ref. ]2(J (Chap. 13) for a univariate 
Gaussian source indicate that the 'best expected' average error for a representation of a given length 7; decreases 
as 7i increases. This confirms the intuitive idea that a random variable like r\ is better described as 7; grows. 
However, when this happens the entropy grows too, and the compression chances are reduced. 

A Appendix: discrete calculations 
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A.l One-dimensional case 



First, we rewrite the Z of (3.2) as 



2\2 



n 2 X 



where 



2 =*(-;<) 



_ A?/ 



(A.l) 



(A.2) 



is the size of the discretization interval in units of <7, and 

oo 

0(/3;m)= ^ n 2m e-^™ 2 



is a notation for the sort of Jacobi elliptic theta functions appearing in this calculation. 

Note that the discretization has enabled us to deal with a discrete probability set — ( |3.l[ ) — thus avoiding 
the well-known difficulties associated with H for continuous probability distributions. In our own case (calling 
H = Hi all through this subsection), 



H= - 



1 



ln(2) 



A^ 
2 



* 1 
2tt' 



A^ 

— ;( 

2tt' 



-In 



\ 2 



For m = 1, we just observe that 
Using this, we arrive at 

1 

H 



ln(2) 



0(#i) = --i<W). 

7T dfj 



p— In [0(/?;O)]-ln [0(/J;O)] 



with /? 



A^ 
2tt 



1 

T' 



By (A.l), this can also be written as 



H = 



1 



[Tln(Z)]. 



(A.3) 



(A.4) 



ln(2) dT 

Up to the trivial change of units — or, equivalently, a conventional modification of the Boltzmann constant 

H is the thermodynamical entropy S of a one-particle system at temperature T with partition function Z . In 

1 



the situation we are studying, this Z is Z(T) = 9 ( — ;0 j as given by eq. (3.2). However, the validity of eq. 



(A.4) is quite general: in fact, for any system with probabilities of the form 

e -Ej/T 



PJ 



z 



where Z 



J2e- El/T 



(where /, J can be single indices or multiple indices), one may check that, after applying the definition (|2.1|) or 



(|2.3|), eq. (A.4) holds. Therefore, we might as well have started our calculation of H from eq. (A.4) itself (and 



we will do so for the iV-dimensional case). Analogously, —T\sx{Z) plays the role of the Helmholtz free energy 

dF 

F, satisfying the relation S = — — . 

dT 
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A 'finely' or thinly spaced discretization means that A should be small. However, the above expression 

A 2 

of 0(/3; m) as a series is obviously inadequate when f3 = — <C 1. Such a difficulty will be overcome by recalling 



the remarkable theta function identity (see e.g. ref . [fl6| 



2tt 



(A.5) 



Applying now this identity to (A. 3) or (A. 4), expanding each part for small A and differentiating, one finds 



H = 



ln(2) 



2(1-^-1 t~~ - 2< — 



8 _64 (2n _&rj 



(A.6) 



which, regarded as an expansion, is quickly convergent for < A <C v2tt- (One should notice that, actually, the 
two expressions have a generous overlap around (3 ~ 1 where both converge and any of them can be consistently 
used). 

There is no explicit dependence on a, as the only relevant variable is the relative discretization size A. 
Even for moderately large values of A, the next-to-leading part of H is very small: e.g. for A = 1 we have 



1J.10-8, = 2tt e ~ a2 



4.5 • 10 , at A = 1/2 these two quantities become 1.3 • 10 



-33 



and 6.6 • 10 -68 , respectively. Neglecting such terms, we easily obtain a good approximate formula, which may 
be reexpressed as 



H 



2ne 



2tt 



with A given by (A. 2). This yields eq. (3.3) 



(A.7) 



A. 2 iV-dimensional case 



After looking at the Z of eq. (4.8), let's introduce, for convenience, the new notations 

Arj 



a = mm 



({a u ...,a N }), X ~=<r 2 C-\ 



A : 



(A.. 



which enable us to write 



v-^ — —(m,...,n N ) x (ni,...,n N ) 
Z = > e 2 



ni ,. . . ,njy — — oo 

In terms of the multidimensional Jacobi theta function 



JV 



7 iY 



(P\M) = E e 



-7T/3 £ M K 



we can put 



A 2 1 



z = e N (p\ x - 1 ), p= — = - 



By (A.4), the joint Shannon entropy (now H = Hjsr) becomes 



(A.9) 



(A.10) 



(A.ll) 



(A.12) 
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We are interested in approximations for small j3, but the present expressions are inadequate for this situation. 
The way out is to take advantage of a Jacobi identity for multidimensional theta functions, namely, 



? N (0\M) = 



1 



M" 



(A.13) 



[Det(M)] 1 /2 / 3^/2 

which, unlike the initial expression, may be expanded for small [3. Doing so (and noting that C has to be real 
when viewed in configuration space), 



H = 



N 



ln(2) 
= l«g 2 



1 + l n ( JL ) + _L InDct(x) +o( L-f™in(C/** 

2 \y/PJ 2N w \P 



( 2vre 
'Det I - 



C 



2-KO 2 2x 2 min(C) 

+ c ,,.V^_ e -^- 



(A.14) 



where min(C) means the minimum over the (positive) eigenvalues of the correlation matrix C, and where the 



relations (A.S) and the definition of in (A.ll) have been used. More terms of this expansion can be obtained 



explicitly by using eq.(A.6). Notice, however, that each order of ( |A.6| ) gives rise here, in principle, to A?" different 
orders (the first A^ of them corresponding to the sequence of eigenvalues of C, increasing in magnitude). The 



bottom line gives us (4.9) 



B Appendix: useful Fourier-space results 

B.l Discrete Fourier transforms 

The continuous transforms taken as reference are 



m 

T](x) 



dxe 77(2;), 
ge^r?(fc). 



where k and x are a pair of conjugate variables. Discretizing them, 

k n = nAk, 
x n = nAx, 

and calling 



we construct discrete transforms which, in the continuum limit, reproduce (B.l) 

Vn = 



'3 T]m 1 



Ak x 
2^^ 



~ 'In 



(B.l) 



(B.2) 



(B.3) 



Taking into account (B.2) and the correct relation between sampling intervals, i.e. Ak 



2tt 
NAx 



(B.4) 



one realizes 



that 



kn %m — km 



2tt 
~N 



ran. 



(B.5) 
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Therefore, we can write 



rf n = Ax {Wff) n , 

Ak fw^ > (B ' 6) 



2tt 

where W is the symmetric matrix with coefficients W mn = e N . After renaming 



x — * t, 

(B.7) 

k — > cj, 



this yields the expressions ([B. 



B.2 Fourier-space relation involving Det(C) 

For convenience, we prefer to handle the Fourier-space representation of C — which we shall denote by C — 
rather than C itself (we will see that C is simpler). A vector ff and its discrete Fourier transform ff are related 
by expressions of the type 

ff = At Wff, 

Au; _ > (B- 



Ztt 



2tt 
I —— ran 



where W indicates a matrix whose coefficients are given by W mn = e N (see subsec. BA). At is a t-interval 
which now has to be interpreted as the time lapse between two successive Fourier 'samplings'. If we imagine that 
rjj = r)(tj), then tj — tj-i = At, Vj. Alo is the corresponding interval in 'angular frequency' or conjugate space. 
Taking into account the usual relation between the sampling interval and the associated angular frequency (or 
conjugate momentum) range that can be correctly sampled in conjugate space, one has the following relation 
between At, Aoj and N: 

Alu = 2tt— !— , (B.9) 
NAt 

The discrete values of uo are 

uj, = j Alo, j = -N/2, N/2. (B.10) 
Let w m in and WMax denote the minimum and maximum nonzero absolute values of u. Then, 



2tt 

w,nin = lo 1 = Alo = = 27r/ min , 

TV 7T 
^Max = — ~ ^1 — 27T/Max- 



(B.ll) 



We have here introduced frequencies — /'s — in the way co = 2irf, as usual. 

Furthermore, by the form of its coefficients and by eq.(]B.8|), it is clear that the W matrix satisfies 



W T = W, 

w-> . ^w, (R12) 

2tt 
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1 AtAuj rp 

and, consequently, W — — W . In other words, up to a multiplicative scalar constant, W is a unitary 



operator. Taking now formula ( |4.q ), we apply ( |B.8| ) and (B.12) to write the C matrix in terms of Fourier-space 
objects, and quickly obtain 

c= (£k) w - ldw > ^ 

where C is the above mentioned Fourier-space representation of C, i.e., it is the matrix whose coefficients read 

C jk = {%%)■ (B.14) 



Formula (B.13) is telling us that 



A \ N 
Aid 



Det(C) = ( ^ ) Det(C) (B.15) 



independently of W. 



B.3 Power Spectrum 



Recall first the definition fl4.6| ) of the covariance matrix: Cjk — (VjVk)' where (■ ■ ■) denotes statistical average 
over realizations of the stochastic process rj. For a stationary stochastic process we have that Cjk = Cj^k can 
only be a function of j — k, eg the covariance matrix is Toeplitz matrix. It is a simple exercise to show that in 
this case the covariance matrix in Fourier space C is always diagonal: 

Qfc = (VjVk) oc Sjk (B.16) 

The power spectrum is then defined as: 

C jk =P(uj)^ (B.17) 

in analogy with the continuous definition: 

(rj(cj)rr(cj')) =P(w)5 Dlrac (w-w')- (B-18) 

B.4 Nonzero mean 

In the practical handling of data, it is sometimes necessary to introduce offsets, with the consequence that a 
variable which had initially a zero mean may lose such a property. Assuming that (r/) = 0, let's suppose that 
an offset a E R is added to rj. Thus, for the new variable rf = rj + a one has (77') = a. From the definition ( |4.6| ), 
we find that the covariance matrix of rj' is just 

C' = C + a 2 I, (B.19) 

where C is the covariance matrix of 77, and I is the identity matrix. Relating now covariance and power spectrum 
by eqs. ( B.13| ) and (4.15) — or (B.17) — , one realizes that the new power spectrum is simply 

P' = P + 27rAta 2 I, (B.20) 

i.e., the previous one shifted by a constant, which corresponds to the entropy change coming from the knowlegde 
of (77') = a. Proceeding in this way, it is possible to use the same computational methods as for the zero-mean 
case, with the only difference that P has to be modified according to eq.( B.20| ). 
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C Appendix: The continuous random variable case 

As it is well-known, Shannon's entropy was firstly designed to deal with discrete random variables 

= -£pfe)log 2 [pfe)], (CI) 

3 

where the index runs through all possible different countable values of the r.v.. The problem with the continuous 
r.v. is that different 77/s do not form a partition. To define H(r]) we form first the discrete r.v. rj\ obtained by 
rounding off n 

t]a = nA-q , if nA-q — A-q < x < nArj. (C2) 

Clearly 

pnA'q 

P(t] A = nA-q) = P(nAn - An < n < nAn) = dn f(n) = At] f{nArf), (C.3) 

J 7iArj—Ar} 

where f(nAr}) is a number between the maximum and minimum of f(r}) in the interval (nAr/ — Ar},nAr}). 
Applying Shannon's definition, we have: 

00 

H( VA ) = - ]T Anf(nAn)\og 2 [Anf(nAn)} (C.4) 



and, since 



we conclude that 



°° />oo 

]T A V f(nA V )= / d V f( V ) = l, (C.5) 



H( VA ) = -log 2 (Ar])- J2 ^vf(nAn)log 2 [f(nAn)]. (C.6) 



n— — 00 



As An — > 0, the r.v. -q\ tends to ij; however, its entropy H(t]a) tends to 00 because — log 2 A77 — > 00. This is 
why we define the entropy H(rj) of n not as the limit of H{n&) but as the limit of the sum H(i]a) + ^o^i^v) 
when A-q — > 0, i.e.: 

/OO 
dvf(v)log 2 [f( V )} as An^O. (C.7) 
-OO 

So, the definition of 'entropy' for a continuous variable is: 

/OO 
dvf(v)log 2 [f( V )}, (C.8) 
-00 

where the integration extends only over the region where f(rf) 7^ 0, as we have f(rj) log 2 [/(»7)] = if f(i]) = 0. 
This 'entropy' is more usually called differential entropy in the literature and its definition can also be extended 
to multivariate probability distributions. It is easy to see then that the above limit translates into: 

/OO 
drff(rf)log 2 {f(rf)} as An -> (C.9) 
-OO 
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when the JV-dimensional space of fj is latticed with A?y-boxes. So, we could approximate H(fj&) ~ — log 2 [Arj) N - 

tj t ->\ t , A n , ■ + - average length #(?7a) t , 

ii (77) . In our case we nave deiined the compression ratio as c r , opt = ^ , where h = — — — . 1 1 w < • 

look just back to the approximate: 



N 



\og 2 (Ar,)+H(fj)/N. 



(CIO) 



The last summand in the previous expression is the average uncertainty per sample in a block of N consecutive 
samples. The limit N — > 00 of it is what is known as differential entropy rate: 



h(ff) = lim 

JV^oo 



Hjjju ...,t]n) 
N 



(C.ll) 



So, if we imagine that we have a stochastic process infinitely long and fj is a vector r.v. whose dimension 
tends to infinity (i.e. r)j — r)(tj) and we take samples for a long time or just many samples) we could then 
approximate: 



^~-log 2 (A ?7 ) + / l (7f). (C.12) 
Regarding h(ff) as the 'continuous part' of the entropy per component — i.e. h cont — , this relation amounts 



to eq. 



C.l Entropy in the continuous case 



For the one-dimensional Gaussian distribution in cq.(3T) it is straight forward to show that 



h = h r 



log 2 



'2-Kecr 



(C.13) 



in agreement with eq.(3.3) in the limit of small Ar), as expected from the comments in the previous section. For 
the case of N-dimensional Gaussian noise with correlations, we can use the fact that h{ff) is well-known (see eg. 
for a Gaussian stochastic process with power spectrum P(ui): 

h(ff) =log 2 [x/2^] + ^ J du iog 2 [P(u)]. (C.14) 

where P(u>) refers to the discrete stochastic process derived from the continuous one P(ui) by the relation 

z., s 1 „ f w + 27TTO s 



At ^ \ At 

m— — oo 



— IX < LJ < 7T, 



(C.15) 



where At is the sampling interval that discretizes the process. For power spectra with a bandwidth limitation 
this reduces to (see @ 

(C.16) 



P{uj) = J_p(J_ w ) 
K ' At y At ' 



where P{oj) refers to the process r/ n = rj(t = nAt). In this case we can do a simple change of variables u/ — lo/ At 
in eq. flC.14D to find: 



1 r At r /At 

— / aluj log a [P(w)] = -21og 2 At + — / dw' log a [P(o/)]. 



4tt 



2tt 



(C.17) 
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where we have used the parity of P(oj) and the fact that the range in eq.( C.14 ) is symmetric . Recalling that 



WMax = vr/Ai and we are using u) m - m ~ we see can that this calculation is equivalent to the Euler-Maclaurin 



summation formula eq.(4.27), so that the continous calculation of the entropy given by eq.(C.14) and eq.(C.12) 



yields identical results to those of the discrete calculation eq. ( t4.12|) in the limit of large N 
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